SQL Programmer’s workshop

Comments 0

Share to social media

In this article, I’ll be taking you through the practical steps of creating a SQL Server routine, based on the stages I described in my last article ‘How to write SQL code’. I don’t make a claim that this is a standard method of doing it, since I’ve never come across one. All I can say is that it works for me. It will describe a simple routine purely to emphasize the advantages of a structured approach whatever the size of the project, and to make the size of this article manageable.

For this exercise, we want to design and build  a new routine that searches any text-based column that you specify  in any table for instances of a string. It returns all the ‘hits’ or matches in their context in order to make it easy to scan content.  We want to do this in order to quickly find all occurrences of a string in the chunk of text in which it occurs,  in any column of any table.  I’m going to try to make it generally useful, but I’ve been asked to  provide it as a tool for an actual production use as part of an admin process for monitoring website content.

1/ Up-front time estimation and planning

I’ve only got six hours I can spend, so I’ll have to cut corners. At this stage, I’ve only a vague idea of what is wanted and a slight feeling of apprehension, since the result isn’t a subset of the data set. A single tuple could furnish us with several ‘hits’ or matches, and there is no way of telling at this stage whether we need to use  a wildcard search. I’ll do a quick Gantt chart to give me a rough idea of progress.

1336-img2D.jpg

Obviously, this didn’t take very long at all since I have a template ready-made. (I use SmartDraw)

So what do we really want in the results?  The amount of data in the system suggests to me that a simple wildcard search will be sufficient. I like creating ‘inversions’ for doing fancy searches such as proximity searches and ‘google-style’ searches, but this, I reckon, isn’t going to be one of those times. I’ll need to test that assumption  too!

Another worry is how to go about making this generic, so I’ll aim for a barebones design and put in extra niceties if I have some time spare after getting the basic system working. In other words, I’m ‘time-boxing’.

2/ Produce the test data, and the automated tests

Before we do anything else, we need some simple data for component-testing. In this case, we can grab some sentences from my first article on ‘How to write T-SQL Code’ and make them into a quick  table for testing the various parts of the routine, and we’ll  add a few ‘edge’ cases just to catch the obvious errors. This will do for initial component testing and we can keep it handy if we need to alter anything.  We can even store it for ‘Assertion-testing’ . We’ll tuck it into a View since that is easier to maintain, but normally we’d probably just use the statement directly to produce the ‘table’

Now, with a bit of head-scratching, and the use of a simple RegEx in the SSMS ‘find’ dialog-box, we work out that, if we stick rigidly to just displaying fifty characters,  the result we need is this, if we search on the word ‘Test’.

We then construct a view that gives the correct result.

This was an interesting exercise; especially as it fleshed out the underlying rules of where to put the ellipsis, and how to get the context.  At this stage I toyed with the idea of always starting and ending the context on a word boundary, but shied away from it for the time being.  ‘Let’s do it later if there is time’. I decided.  So, component-testing this should be easy by just using our test data and comparing the result with what one would expect. We can do a simple ‘assertion test’ using the same method too. We’ll demonstrate how to make a simple test harness in a moment.

The next thing we need to think about is performance and scalability testing.  Here we can get acres of text complete with nasty surprises from any book. Generally, I prefer  a million or so rows to test scalability with. It used to be a lot less before SQL Server 2000, but both hardware and software have improved so much that generating test  runs has been more tricky. SQL Data Generator can cope well, but the text is just a bit too uniform, so a book it must be.

 We can read in a whole book. Here is a routine that will take in a text-based book and read it into a global temporary table, a sentence on each line.

With this in place, we probably have the necessary tools to create a simple test harness.

We can simply run this batch code just to prove that, unless we write some code to provide the functionality, , then the test will fail. I know this sounds obvious but this sort of system catches some very silly common errors

What we have here is a simple test harness. We just add code into  the part that has the comment ‘-do stuff here’ until you stop getting errors. Coding is really that easy!

3/ Produce the development harness.

This is going to be quick. We’ll simply use a minimal  harness like this.  To test it, we’ll just put in three arbitrary points in order to show that they took immeasurable time to execute.

4/ Determine the likely best algorithms, assemble the candidates

So we decide we are going to need to use a wildcard search (LIKE and PATINDEX) to find the strings, otherwise the routine won’t be much use. We won’t do a proximity search (two or more words in proximity within a string)  since this requires a CLR RegEx function  and we’re trying to keep things simple.

The first problem we hit is that a  string can be found several times in a column.  You will therefore find it tricky to use a join to get the result.  Maybe you could use a number table but you’d then have to iterate through the rows . If we can’t do a simple join then there is going to be iteration there.  This is something that must be kept to a minimum.

I’ve now got to decide how generic I’d like this to be.  Every programmer’s instinct is to make what they write elegant and generic.  Here we have an immediate problem in that our code assumes that the primary key of the table, of which we’re searching the string-based column,  is an integer.  You’ll never win the battle to store primary keys in a generic way. We’ll just have to restrict ourselves to assuming an integer primary key or unique index. Generally, they seem to be!

What sort of routine are we aiming for? I could go for something that searches all possible rows in a database like this one, but here. we are only wanting to search a few columns from a few tables. I decide that a Stored procedure is fine for doing this.

What sort of output do we want? A result? An output variable? Are we outputting a SQL Server result, XML, an  XHXHTML fragment?

5/ Sketch out the candidates

The first principle we need to stick to is to access the table that you’re searching  as little as possible, so we’ll try to do it just once; when testing out your ideas, it is worth checking the execution plan to make sure that this is happening.  Using the test data view, we probably want to just scoop out the likely candidates. We use PATINDEX to find those rows that contain the word ‘test’ and record its first occurrence.

A quick check shows us that ‘so far, so good’. We’ve scooped up the rows we want in one pass and determined the location of the first matches of the wildcard ‘%test%’ at the same time.

I try out a few ideas for listing all the matches from the rows of the table. There seem to be two likely candidates. We’ll test them out

6/ Run preliminary performance and scalability tests

Is this SQL that we’ve written any  more efficient than a simpler version such as this?

We pop them both into the development harness to check, using the ‘Canterbury Tales’  to test on.

It turns out that they have almost identical timings, at 250 Ms,  and a quick look at the execution plans shows  why:  they generate identical plans on our test data. We can safely use the simpler version for the time being.

We then turn our attention to producing the matches.

We can take the procedural approach

(I developed this with the small result set, checking with the test result, and then searched for the substring ‘body’ in the Canterbury Tales. Hmm, quick. Even the iteration part only takes 70Ms with a reasonably small result.

We can do better. If we use this algorithm, we get the second part to 6 Ms, measured with the test harness. The scan of the entire Canterbury  Tales took only 240 Ms.

In doing this exercise, I’ve noticed how bad the context search strings look with returns and linefeeds in them, and how silly the margin looks in the text file. I also wonder whether they’d be improved by showing exactly where in the displayed string the match took place. If I get time, I’ll go back and improve that, but for the time being I’ll press on and get something running, since with the test and dev harness worked out, this can be done quickly.

7/ Build

We are now at the point where we can be reasonably confident that the routine is working right, but we can save the test scripts we’ve done in case we find a bug and have to revert, or if we have time for improvements.

I like to delay creating a procedure or function as long as I can just  because it is so much easier to do component testing on the various component blocks of logic (e.g. WHILE loops) and expressions, and it is good to be able to look at intermediate results.

AtAt this stage, I felt confident enough that things were working well enough to encapsulate the logic in  the stored procedure.

I’d soon popped in comments, a header, improved the variable names here and there and generally tidied it up. I parameterized all the obvious things that would benefit from being parameterized.  I turned it into a generic routine that would work for any table. A glance at the clock showed me I was well ahead of the allotted time.  I I was still niggled by the fact that the context of the match was wrong. I’d decided that it would be OK to break the context anywhere rather than at a word boundary, and the job that required the code could be done with what I’d delivered, but it looked tatty so I redid the code so that, instead of …

It looked like…

The next part of the build process was to test it out on  a number of databases and tables.  As is often the case, each new table I tried it on threw up a new difficulty. Although I’d put in a few diagnostics, I wished at this stage I’d done a few more as it would have speeded this process. I had a lot of fun with Chaucer’s ‘the Canterbury Tales.’ It is always a surprise to find such words were in the classics.

8/ Unit Test

Whilst doing bug-fixes in the light of testing, on various sizes and types of text-based columns, I did a simple unit test to make sure I’d not broken anything.  This is so unobtrusive that it can be added to the build script, or used for  regular checks. Although I’ve tried to pretend that these phases of developing a procedure are  distinct, there is actually quite a bit of leaping back and forth between phases in response to finding a bug, or improving the result. However, in the Unit Test phase, it is now time to set up at least one simple automated test to run whenever you make a change. Here is the one I used, based on the test harness I showed you earlier.

I also ran a number of checks on a number of databases I have lying around to make sure that nothing else is obviously broken.

9/ Check-in

I I like the finality of the check-in. Time’s up, and I’m running a time-box on this routine.  The paint is still a little bit wet, but it is time to nip out to the pub. It is pointless to describe the nuts and bolts of Check-in: it is just a handy way to end an article. You’ll probably be using SQL Source Control, a piece of software that is close to my heart.  As a punishment, I’m forcing myself to use GIT raw so I can encourage others to improve some of my work, and start to do some collaborative work! The current version of the stored procedure can be downloaded at the bottom of the article.

Here is the state of play with the routine

Conclusions

There is a certain terror in disclosing the processes behind coming up with a  T-SQL routine. It is much cosier to pop up with the finished T-SQL as if one wrote it like a Shakespearian sonnet. It belies the actual errors, dead ends, and frustrations that are part of the process. This is probably why it is rare to see the actual process explained in detail. This article will be pretty meaningless unless you have read the first article in this series, How to develop T-SQL Code. Once again, I’d caution you that this method works for me but you’ll probably find that every SQL developer has a different way. Hopefully, this article will encourage some other accounts of how to be a productive SQL programmer.

Article tags

Load comments

About the author

Phil Factor

See Profile

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 40 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career. See also :

Phil Factor's contributions